Skip to content

Add OpenTelemetry tracing across the backbeat pipeline#2733

Open
delthas wants to merge 6 commits into
development/9.4from
improvement/BB-764/otel-replication-tracing
Open

Add OpenTelemetry tracing across the backbeat pipeline#2733
delthas wants to merge 6 commits into
development/9.4from
improvement/BB-764/otel-replication-tracing

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 14, 2026

Summary

Add OpenTelemetry tracing to backbeat so async work (replication, lifecycle,
GC, notifications) can be traced back to the original S3 request in Jaeger.
Arsenal already stamps value.traceContext.{traceparent,tracestate} onto every
MongoDB metadata write; the oplog carries it; this PR extracts it and
propagates it across backbeat's Kafka pipeline.

Gated by ENABLE_OTEL=true — when unset, init() returns before loading any
@opentelemetry/* package (zero overhead off the OTEL path).

Design and structure mirror the cloudserver OTEL PR
(scality/cloudserver#6140,
CLDSRV-884), which went through full human review.

Commits

Reviewable in order — each loads on its own and builds on the previous:

  1. chore: add OpenTelemetry dependencies — explicit @opentelemetry/* packages, no auto-instrumentations-node bundle.
  2. feat: add OTEL trust-boundary host filterlib/tracing/trustedHosts.js (operator-supplied OTEL_TRUSTED_HOSTS).
  3. feat: add OTEL SDK bootstrap and tracing facadelib/tracing/index.js (init/close/isEnabled), sampler, span limits, traces-only, outbound-only HTTP, mongodb/ioredis/aws-sdk instr, bounded shutdown flush.
  4. feat: propagate trace context across the Kafka pipelinekafkaTraceContext.js, producer 7th-arg headers, publish() headers param, consumer span (links, never parent-child).
  5. feat: instrument replication and oplog-populator pods.
  6. feat: instrument lifecycle, GC, and notification pods.

What it does

  1. lib/tracing/ facade — public surface is init() / close() /
    isEnabled(); OTEL internals are hidden. init() fails fast (assert) if
    ENABLE_OTEL=true and OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is unset, and on
    an out-of-range OTEL_SAMPLING_RATIO. No baked-in endpoint default.
    Traces-only NodeSDK (logRecordProcessors:[] / metricReaders:[]),
    explicit span limits, ParentBasedSampler({ root: TraceIdRatioBased(ratio) })
    (default 1%) so a sampled upstream trace is always honored.

  2. Explicit instrumentationsinstrumentation-http +
    instrumentation-aws-sdk + instrumentation-ioredis (with
    requireParentSpan) + instrumentation-mongodb (with
    enhancedDatabaseReporting:false for PII masking). No
    auto-instrumentations-node bundle (avoids ~36 unused instrumentations,
    version skew, and a transitive @types/pg conflict). MongoDB is
    instrumented because several pods use the mongodb driver directly —
    oplog-populator (MongoLogReader tailing the oplog), lifecycle conductor
    (scan-path collection queries), and notification (MongoConfigManager).

  3. HTTP: outbound onlyinstrumentation-http runs with
    disableIncomingRequestInstrumentation: true. No instrumented pod serves
    application HTTP — they are Kafka producers (oplog-populator, conductor) and
    consumers (replication, lifecycle, GC, notification processors); the only
    inbound HTTP is k8s probes / Prometheus scrapes, never a useful trace entry.
    So server spans are never created, which also removes the need to maintain a
    health-path ignore list. (The backbeat API server in bin/backbeat.js has
    real routes and is not instrumented here; instrumenting it later must
    re-enable server spans.)

  4. Trust boundarylib/tracing/trustedHosts.js. Trusted hosts come from
    an operator-supplied OTEL_TRUSTED_HOSTS env var (comma-joined lowercase
    bare hostnames); loopback is always trusted. Outbound calls to untrusted
    hosts (replication destinations, AWS/Azure/GCP, remote Artesca) have
    traceparent/tracestate stripped and the client span tagged
    scality.trace.suppressed. Unset env → loopback-only (safe default).
    Emitting the var for backbeat pods is an operator follow-up (cf. cloudserver's
    ZKOP-551).

  5. Kafka propagation = span Links — every consumer read starts a new
    trace (via ROOT_CONTEXT) and adds an OTEL Link to the upstream span (never
    parent-child). Async work fires minutes/hours after the S3 request; links
    keep each trace small and navigable via Jaeger's link UI instead of
    producing million-span, multi-hour waterfalls. Producer/consumer header
    plumbing rides node-rdkafka's MessageHeader[] (array of single-key
    objects). Consumer spans set the OTEL messaging semantic-convention
    attributes (messaging.system, messaging.destination.name,
    messaging.destination.partition.id, messaging.consumer.group.name) and
    are marked ERROR on task failure.

  6. Graceful flush — each pod's SIGTERM path calls tracing.close(), which
    is race-safe and bounded at 5s (Promise.race + .unref()'d timer) so an
    unreachable collector can't block past Kubernetes' 30s grace period.

Pods instrumented

oplog-populator, replication (data + status), lifecycle (conductor, bucket,
object/transition), GC, notification. Entry points call
require('../lib/tracing').init() before any HTTP/aws-sdk/ioredis/mongodb
module loads.

Compatibility note

Rebased onto development/9.4 (node-rdkafka ^2.12.0 → ^3.6.0). Verified the
produce() 7th-arg headers signature and MessageHeader[] format are
unchanged in v3.

Tests

Unit tests under tests/unit/lib/tracing/ (index boot/close/isEnabled +
fail-fast asserts, trustedHosts env parsing + request-hook strip/IPv6,
kafkaTraceContext) plus notification populator trace-header propagation.

Status

Reworked to mirror the reviewed design of cloudserver #6140; earlier automated
review comments triaged and resolved; split into 6 reviewable commits. Cluster
end-to-end verification with Jaeger + OTEL collector is the remaining open item.

Behavioral note: replicationStatusProcessor's SIGTERM handler now
force-exits with process.exit(0) after the trace flush. The original had no
success-path exit (it relied on the event loop draining naturally); this aligns
it with the other seven pods and makes shutdown deterministic. Intentional —
flagged explicitly so reviewers don't mistake it for an accidental change.

Issue: BB-764

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 14, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 14, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 82.25806% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.97%. Comparing base (c52fbcc) to head (6357120).

Files with missing lines Patch % Lines
...tensions/lifecycle/conductor/LifecycleConductor.js 42.30% 15 Missing ⚠️
bin/queuePopulator.js 0.00% 2 Missing ⚠️
extensions/gc/service.js 0.00% 2 Missing ⚠️
extensions/lifecycle/bucketProcessor/task.js 0.00% 2 Missing ⚠️
extensions/lifecycle/conductor/service.js 0.00% 2 Missing ⚠️
extensions/lifecycle/objectProcessor/task.js 0.00% 2 Missing ⚠️
extensions/notification/queueProcessor/task.js 0.00% 2 Missing ⚠️
extensions/replication/queueProcessor/task.js 0.00% 2 Missing ⚠️
...ons/replication/replicationStatusProcessor/task.js 0.00% 2 Missing ⚠️
lib/queuePopulator/QueuePopulatorExtension.js 50.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/tasks/LifecycleTask.js 91.65% <100.00%> (+0.10%) ⬆️
...ensions/notification/NotificationQueuePopulator.js 98.21% <100.00%> (+0.03%) ⬆️
...cation/destination/KafkaNotificationDestination.js 81.25% <100.00%> (+2.18%) ⬆️
extensions/replication/ReplicationAPI.js 87.50% <100.00%> (+0.83%) ⬆️
...xtensions/replication/ReplicationQueuePopulator.js 91.93% <100.00%> (-1.40%) ⬇️
lib/BackbeatConsumer.js 95.41% <100.00%> (+0.52%) ⬆️
lib/BackbeatProducer.js 90.17% <ø> (ø)
lib/tracing/kafkaTraceContext.js 100.00% <100.00%> (ø)
lib/tracing/trustedHosts.js 100.00% <100.00%> (ø)
lib/queuePopulator/QueuePopulatorExtension.js 97.22% <50.00%> (+6.04%) ⬆️
... and 10 more

... and 5 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.30% <80.00%> (+0.07%) ⬆️
Core Library 81.76% <98.36%> (+0.78%) ⬆️
Ingestion 70.63% <ø> (-0.61%) ⬇️
Lifecycle 78.67% <46.51%> (-0.39%) ⬇️
Oplog Populator 85.83% <ø> (ø)
Replication 59.72% <55.55%> (-0.07%) ⬇️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.4    #2733      +/-   ##
===================================================
+ Coverage            74.73%   74.97%   +0.23%     
===================================================
  Files                  199      202       +3     
  Lines                13650    13824     +174     
===================================================
+ Hits                 10201    10364     +163     
- Misses                3439     3450      +11     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.01% <0.00%> (-0.12%) ⬇️
api:routes 8.83% <0.00%> (-0.12%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 10.94% <9.13%> (+0.81%) ⬆️
ingestion 12.46% <7.52%> (-0.12%) ⬇️
lib 7.76% <9.13%> (-0.02%) ⬇️
lifecycle 18.93% <22.58%> (-0.07%) ⬇️
notification 1.01% <0.00%> (-0.02%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.61% <9.67%> (-0.11%) ⬇️
unit 51.62% <75.26%> (+0.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/BackbeatConsumer.js
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 9d08f7b to 2f7afb0 Compare April 14, 2026 10:44
Comment thread package.json Outdated
Comment thread lib/BackbeatConsumer.js
Comment thread lib/tracing/kafkaTraceContext.js Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 2f7afb0 to d562a0a Compare April 14, 2026 10:47
Comment thread lib/BackbeatConsumer.js Outdated
Comment thread package.json Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread package.json Outdated
Comment thread package.json Outdated
Comment thread lib/BackbeatConsumer.js
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from d562a0a to 51a9f61 Compare April 14, 2026 11:00
Comment thread package.json Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 51a9f61 to b9d3528 Compare April 14, 2026 16:07
Comment thread lib/BackbeatConsumer.js
Comment thread OTEL.md Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 970a811 to 849d6b0 Compare April 15, 2026 15:22
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/BackbeatConsumer.js
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
Comment thread lib/BackbeatConsumer.js
Comment thread extensions/replication/replicationStatusProcessor/task.js
@scality scality deleted a comment from claude Bot Jun 1, 2026
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch 2 times, most recently from ea6bda3 to 078df44 Compare June 1, 2026 13:34
Comment thread extensions/replication/replicationStatusProcessor/task.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
Comment thread extensions/replication/replicationStatusProcessor/task.js
Comment thread yarn.lock
@scality scality deleted a comment from claude Bot Jun 1, 2026
@scality scality deleted a comment from claude Bot Jun 1, 2026
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 078df44 to 6357120 Compare June 1, 2026 13:48
Comment thread extensions/replication/replicationStatusProcessor/task.js
Comment thread lib/tracing/trustedHosts.js
Comment thread extensions/notification/destination/KafkaNotificationDestination.js
Comment thread lib/tracing/index.js
Comment thread lib/BackbeatConsumer.js
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Well-structured PR — clean OTEL facade, correct trust-boundary handling, proper span lifecycle management with sync-throw guards, and good test coverage. No correctness bugs found. A few observations:

- extensions/replication/replicationStatusProcessor/task.js:74 — Behavioral change: adds process.exit(0) on the success SIGTERM path where the original relied on event-loop drain. Documented in the PR description but worth reviewer awareness.
- lib/tracing/index.js:10 — isEnabled() checks the env var, not sdk state. After close() nulls sdk, the consumer hot path still enters the span code path (safe due to OTEL no-op tracers, but spans created during shutdown are silently dropped).
- extensions/notification/destination/KafkaNotificationDestination.js:109 — Object spread on every message in send() even when no headers present; minor perf consideration.
- lib/BackbeatConsumer.js:537 — @opentelemetry/api is unconditionally loaded at module level even when OTEL is disabled. Lightweight package, but adds to startup. Conscious trade-off per the design.
- lib/tracing/trustedHosts.js:38 — Missing Host header correctly treated as untrusted (empty string not in trust set). Test coverage confirms this edge case.

Overall: solid implementation that mirrors the reviewed cloudserver OTEL design. The gating via ENABLE_OTEL, explicit instrumentations (no auto-bundle), link-based Kafka propagation, and bounded shutdown flush are all well-considered.

Review by Claude Code

@scality scality deleted a comment from claude Bot Jun 1, 2026
@delthas delthas marked this pull request as ready for review June 1, 2026 13:59
@delthas delthas requested review from a team, DarkIsDude and SylvainSenechal June 1, 2026 13:59
@delthas delthas added the claude-review-retro PRs with a Claude Code review that could be improved label Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-review-retro PRs with a Claude Code review that could be improved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants